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Introduction 


Breast  cancer  is  a  heterogeneous  disease,  consisting  of  at  least  five  intrinsic 
subtypes  including  the  luminal  A  and  luminal  B  (estrogen  receptor  alpha  positive; 
ESR+),  Her2+  (v-erb-b2  erythroblastic  leukemia  viral  oncogene  homolog  2  positive), 
basal  (ESR-,  Her2-),  and  normal-like  patient  groups1'3.  These  subtypes  exhibit 
distinct  differences  in  their  molecular  signaling  cascades,  stress  responses,  and  in 
the  types  of  cells  present  within  the  tumor.  For  example,  the  luminal  subtypes  of 
breast  cancer  display  a  strong  estrogen-signaling  component,  while  the  Her2+ 
subtype  reflects  the  downstream  response  of  receptor  tyrosine  kinase  activation. 
Although  such  subtype-related  pathways  and  responses  are  known  to  be 
biologically  and  clinically  relevant,  they  represent  only  a  fraction  of  total  cellular 
pathways  and  responses.  A  more  complete  understanding  is  needed  to  fully 
determine  the  reasons  for  treatment  failure  and  disease  recurrence.  To  date, 
however,  we  lack  a  comprehensive  analysis  of  those  processes  within  the  tumor  that 
are  associated  with  outcome  (or  other  histopathological/clinical  variables),  and 
whether  they  are  dependent  or  independent  of  the  tumor  subtype. 

Our  central  hypotheses  are  that  each  subtype  can  be  defined  as  a  collection  of 
molecular  processes,  that  there  exist  processes  that  can  be  used  to  predict  patient 
outcome  regardless  of  subtype,  and  that  there  exist  a  disjoint  set  of  processes  that 
predict  prognosis  within  each  subtype.  Moreover,  we  argue  that  the  identity  of  these 
processes  can  be  inferred  through  the  combined  use  of  our  de  novo  bioinformatics 
tool  entitled  Breast  Signature  Analysis  Tool  (BreSAT)  and  our  catalogue  of 
transcriptional  signatures  (entitled  BreSAT-DB)  that  have  been  collected  from 
literature  and  resources  such  as  GeneSigDB4  and  MSigDB5,  but  carefully  modified 
and  augmented  to  reflect  the  specific  biologies  of  the  breast  environment.  We  have 
applied  BreSAT  and  it’s  associated  catalogue  BreSAT-DB  to  thousands  of  breast 
tumor  samples  and  models  of  the  disease.  This  has  allowed  us  to  identify  novel 
pathways,  processes,  responses,  and  cell  types  that  are  of  interest  to  disease 
progression  and  outcome,  in  addition  to  the  identification  of  highly  correlated 
processes  that  share  little  or  no  biological  commonalities.  These  processes  of 
interest  were  largely  recapitulated  in  the  models  investigated  thus  far,  although  we 
identify  various  elements  with  relevance  to  the  human  disease  that  are  currently 
lacking  in  the  models.  These  discoveries  are  leading  to  a  more  comprehensive  and 
complete  view  of  breast  cancer  and  the  generation  of  more  accurate  disease  models. 
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Body 


Task  1.  Complete  course  requirements  (year  1): 
la.  BIOC  603:  Genomics  and  Gene  Expression  (year  1). 

All  required  PhD  coursework  was  successfully  completed  in  year  1.  Other  program 
requirements  to  date,  including  research  seminars  1  &  2  (junior  seminar  and  PhD 
proposal  respectively)  were  also  successfully  completed. 


Task  2.  Development  of  breast  cancer-specific  signatures  (year  1): 

2a.  Acquire  signatures  from  literature  and  databases  (year  1 ). 

2b.  Filter  collection  based  on  relevancy  (year  1). 

2c.  Agglomerate  signatures  representing  high  biological  similarity  (year  1). 

2d.  Refine  genes  according  to  behavior  in  breast-related  datasets  (year  1). 

Milestone  # 1  Publication  (year  1 ). 

A  major  component  of  our  framework  involves  the  collection  and  formatting  of 
molecular  signatures,  along  with  the  development  of  an  appropriate  ontological 
annotation.  We  have  termed  this  highly  curated  signature  database  Breast  Signature 
Analysis  Tool  Database  (BreSAT-DB).  Signatures  are  typically  a  set  of  genes  that 
have  been  determined  to  be  differentially  perturbed  in  response  to  either  a  specific 
molecular  event  (e.g.  overexpression  of  ESR),  or  are  markers  of  a  specific  cell  type 
(e.g.  macrophages  versus  pericytes  versus  endothelial  cells).  Signature  databases 
such  as  GeneSigDB4  and  MSigDB5  exist,  and  contain  thousands  of  such  signatures. 
However,  these  signatures  have  been  generated  in  a  variety  of  organisms,  tissues, 
cell  types,  and  with  different  techniques.  Thus,  many  of  these  signatures  may  not 
accurately  recapitulate  the  target  biology  in  human  clinical  breast  samples. 
Furthermore,  in  some  cases,  multiple  signatures  exist  for  what  are  meant  to  be  the 
same  biological  processes.  This  creates  challenges  downstream  in  the  analysis,  as 
separate  signatures  that  represent  the  same  general  process  or  cell  type  may 
contain  a  dissimilar  set  of  genes,  which  exhibit  different  expression  patterns  in 
human  breast  cancer  data,  and  ultimately  lead  to  contradictory  conclusions.  For 
these  reasons,  we  have  refined  and  annotated  thousands  of  available  signatures 
with  features  such  as  the  species  and  tissue  they  were  generated  in,  as  well  as  their 
general  category  (e.g.  whether  they  are  used  to  define  a  particular  cell  type, 
biological  response,  or  a  broad  prognostic  response).  Within  each  of  these 
categories,  the  signatures  are  further  sub-classified  as  appropriate  (e.g.  signatures 
that  define  biological  responses  are  sub-classified  into  one  of  ten  hallmarks  of 
cancer6).  Our  categorizations  are  intended  to  allow  for  the  first  broad  attempt  at 
comprehensively  dissecting  breast  tumors  into  a  set  of  individual  cellular  and 
mechanistic  components,  and  may  be  further  refined  and  expanded  by  the 
community  over  time.  BreSAT-DB  now  contains  approximately  6400  signatures, 
which  have  been  formatted  for  direct  computational  analysis  and  individually 
curated  according  to  features  of  interest  with  respect  to  breast  cancer. 
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In  addition,  we  have  generated  a  data  compendium  now  containing  over  9000 
human  patient  samples  related  to  breast  cancer,  along  with  their  associated 
histopathological/clinical  data.  Our  compendium  has  been  stratified  by  stages  of 
disease  progression  (e.g.  normal  tissue,  DCIS,  1DC,  metastases,  etc.),  type  of  sample 
(e.g.  whole  tumor  versus  cell-specific  tissue  derived  by  laser  capture 
microdissection),  adjuvant  and  neoadjuvant  treatments,  and  type  of  data  (e.g.  gene 
expression  microarrays,  aCGH,  miRNA,  etc.).  While  our  focus  has  been  on  human 
data,  we  also  have  a  sizable  compendium  of  models  for  the  disease,  including 
murine  tumors  and  human  cell  lines.  The  collection  involves  a  rigorous  process  of 
normalization  and  harmonization.  Clinical  parameters  must  be  carefully  matched  to 
determine,  for  example,  whether  recurrence  is  measured  as  a  local  or  distant  event 
that  takes  place  in  a  common  5-  or  10-year  time  frame.  This  ensures  that  clinical 
information  is  directly  comparable  from  one  dataset  to  the  next,  and  allows  us  to 
develop  automated  tools  for  analyzing  the  data. 

The  collection  and  annotation  of  our  database  and  compendium  has  been  relatively 
straightforward,  albeit  a  time  consuming  process.  Year  2  oversaw  minor  updates  to 
the  size  of  the  database  (approximately  400  new  signatures  added),  and  further 
refinement  of  all  signature  annotations.  In  addition,  our  data  compendium  has 
expanded  to  include  thousands  of  additional  samples,  and  we  are  now  in  the  process 
of  collecting  data  from  other  platforms,  including  next  generation  sequencing. 
Unfortunately,  recent  publications  involving  signature  collection  and  analysis  by 
other  groups7'9  required  that  we  re-evaluate,  re-write,  and  expand  aspects  of  our 
manuscript  in  order  to  differentiate  ourselves  and  highlight  the  unique  advantages 
BreSAT-DB  provides  for  breast  cancer  research.  This  includes  detailed 
demonstration  that  signatures  developed  in  the  breast  are  more  informative  than 
equivalent  signatures  developed  in  other  tissue  types,  when  applied  to  breast 
cancer  datasets.  Furthermore,  breast-derived  signatures  contain  genes  that  tend  to 
be  more  highly  correlated  with  one-another,  suggesting  that  BreSAT-DB  is  more 
accurate  and  approximate  than  general-purpose  signature  databases  for  use  in 
breast  cancer  research.  An  updated  manuscript  detailing  the  production, 
composition,  and  utility  of  our  databases  is  now  written  and  undergoing  final 
modifications  for  submission. 


Task  3.  Refinement  of  statistical  methodology  [year  1): 

3a.  Statistic  for  cohesiveness  of  subtypes  (year  1). 

3b.  Statistic  for  association  with  survival/recurrence  [year  1). 

3c.  Statistic  for  stability  of  sample  ordering  [year  1). 

Given  a  panel  of  gene  expression  profiles  derived  from  breast  tumor  samples,  we 
typically  have  some  information  regarding  patient  clinical  attributes  including 
tumor  grade,  stage,  ESR  status,  Her2  status,  lymph  node  status,  and  ultimately 
patient  outcome  with  respect  to  disease  recurrence  and  overall  survival.  The 
canonical  example  of  a  question  that  is  asked  of  such  datasets  is  to  identify 
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molecular  processes  and/or  cell  types  in  the  tumor  that  differ  between  patients  of 
good  and  poor  outcome.  It  is  important  to  note  that  the  assumption  here  is  that 
tumors  be  broadly  divided  into  these  two  groups  before  the  analysis  can  be 
performed.  Various  bioinformatics  tools  like  GSEA5'10  exist  for  this  type  of  analysis. 
However,  the  heterogeneity  of  breast  cancer  suggests  that  a  simple  a  priori  partition 
of  the  patients  into  classes  such  as  good  and  bad  outcome  may  not  suffice.  This  is 
highlighted  by  the  enormous  differences  that  exist  between  subtypes,  and  the 
supposition  that  tumors  of  different  subtypes  recur  for  separate  reasons.  Indeed, 
previous  attempts  at  identifying  prognostic  predictors  of  breast  cancer  outcome 
have  largely  been  confounded  by  the  subtypes,  only  having  utility  in  a  subset  of 
patients11.  Our  observations  suggest  that  the  heterogeneity  of  breast  cancers  does 
not  allow  such  a  simple  dichotomy,  and  it  is  nearly  impossible  to  define  2  or  more 
such  classes  a  priori.  Moreover,  existing  tools  such  as  GSEA  have  a  limitation  in  that 
they  assume  that  a  process  is  significantly  differentially  modulated  between  the 
bipartition  of  the  patients.  That  is,  these  tools  look  for  sets  of  genes  with  high 
expression  in  one  category  but  low  expression  in  the  other.  We  argue  that  it  is  more 
natural  for  samples  to  display  a  range  of  activation  levels  for  a  given  signature.  This 
is  a  biological  reality  that  is  accepted  within  the  community,  but  often  ignored  by 
bioinformatics  methodologies.  For  example,  it  is  common  for  Her2  to  be  genomically 
amplified  one  or  more  times  in  breast  tumor  cells,  and  its  gene  expression  and 
membrane  protein  levels  increase  continuously  in  accordance.  This  increase  has 
been  directly  linked  to  a  corresponding  change  in  signaling  downstream  of  the 
receptor12.  Staining  of  Her2  by  immunohistochemistry  (IHC)  reveals  a  continuous 
range  of  intensities,  which  are  scored  from  0-3+  for  simplicity,  and  often  further 
reduced  to  simply  Her2-  or  Her2+.  While  tumors  are  often  summarized  by  a  simple 
discretization,  it  is  more  natural  for  human  breast  tumors  to  display  a  range  in 
signal  activation  levels  or  in  the  amount  of  various  cell  types  present;  bioinformatics 
methodologies  should  reflect  this  reality. 

To  overcome  this  problem,  we  have  designed  an  intuitive  approach  that  linearly 
orders  tumors  over  individual  signatures  (Figure  1),  thus  measuring  the  strength  of 
the  particular  response  or  cell  type  within  the  transcriptional  profile  of  a  tumor. 
Furthermore,  in  contrast  to  other  traditional  methodologies,  our  approach  does  not 
require  a  priori  that  tumors  be  binned  into  distinct  classes.  As  such,  the  tool  allows 
us  to  investigate  continuous  trends  across  the  data,  assessing  the  relative  activation 
of  signatures  across  a  panel  of  patients.  Using  statistical  approaches  we  have 
additionally  developed,  such  orderings  can  be  measured  for  robustness  and  other 
assessments  of  quality. 

Since  thousands  of  signatures  are  being  employed,  and  each  one  generates  a  unique 
patient  ordering,  we  have  further  developed  statistical  tests  to  identify  those 
signatures  from  this  large  set  that  display  ‘interesting’  behavior.  The  definition  of 
‘interesting’  is  largely  dependent  on  the  particular  question  being  asked  of  the 
patient  dataset.  For  example,  given  a  transcriptional  signature  of  ESR  activation 
(that  is,  the  gene  set  corresponding  to  transcripts  that  are  differentially  expressed 
when  ESR  is  over-expressed),  patients  are  ordered  according  to  their  increasing 
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relative  expression  of  the  signature.  We  may  then  ask  whether  the  patient  order  is 
consistent  with  other  assays  for  assessing  the  degree  of  ESR  activity,  including  for 
instance  IHC  staining  of  the  ESR  protein  (Figure  1).  Alternatively,  a  signature  may 
order  patients  in  such  a  way  that  associations  can  be  made  with  a  variety  of  other 
histopathological/clinical  parameters,  such  as  tumor  subtype  or  patient  outcome. 
The  development  of  statistics  to  identify  such  associations  is  not  trivial.  For 
example,  in  determining  an  association  with  patient  outcome,  the  tumor  ranks  could 
be  treated  as  a  continuous  variable  under  Cox  regression,  essentially  asking 
whether  an  increase  in  patient  rank  linearly  corresponds  to  a  change  in  patient 
outcome.  Alternatively,  the  patients  on  either  end  of  the  ordering  may  share  good 
prognosis,  with  the  patients  in  the  center  of  the  ordering  having  poor  outcome.  Both 
scenarios  present  relevant  information  about  how  a  process  or  cell  type  relates  to 
patient  prognosis,  but  they  require  different  means  of  analysis.  There  are  benefits 
and  drawbacks  to  the  various  approaches,  and  ultimately,  any  biological  conclusions 
depend  on  such  choices. 

We  have  successfully  developed  a  variety  of  statistics  that  are  able  to  determine 
associations  between  the  patient  ordering  and  discrete  clinical  variables  (such  as 
ESR  status  or  tumor  subtype),  continuous  variables  (such  as  age),  as  well  as  patient 
outcome.  In  addition,  we  have  developed  statistics  that  measure  the  stability  of  a 
patient  ordering  generated  by  a  particular  signature,  when  compared  against  the 
stability  generated  by  a  random  set  of  genes.  This  allows  us  to  filter  out  those 
signatures  that  are  less  trustworthy  in  the  data. 

The  type  of  statistic  described  thus  far  treats  each  signature  independently. 
However,  a  natural  question  arises  as  to  whether  dependencies  exist  between  the 
patient  orderings  generated  by  each  signature.  There  may  be  technical  reasons  for 
dependencies  between  signatures  (e.g.  they  have  many  genes  in  common),  or  there 
may  be  some  underlying  biological  reason.  For  such  a  set  of  signatures  that  order 
patients  in  a  similar  way,  we  wish  to  investigate  whether  they  also  tend  to  share 
associations  with  histological/clinical  parameters  and/or  functional  ontologies.  To 
investigate  this,  we  begin  by  calculating  the  correlation  between  every  pair  of 
patient  orderings,  and  use  this  information  to  build  a  graph  network  with  edges 
placed  between  nodes  (signatures)  that  have  a  high  correlation  (figure  2).  Highly 
interconnected  regions  of  the  graph  are  investigated  for  overrepresentations  in 
associations  with  available  histological/clinical  parameters.  This  is  not  simply  a 
technical  investigation,  but  one  with  biological  and  clinical  worth.  The  fact  that 
processes  are  correlated  tells  us  about  how  tumor  cells  respond  to  stress,  and  hints 
at  the  molecular  level  regulatory  interactions  that  take  place  in  tumor  progression. 
This  in  turn  suggests  better  stratification  of  patients  for  the  development  and 
success  of  new  treatment  targets.  Thus,  such  a  signature-network  approach 
identifies  functionally-related  signatures,  even  when  the  signatures  represent 
different  biological  processes  that  share  little  or  no  genes  in  common. 


Task  4.  Application  of  framework  to  datasets  (year  1-3): 
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4a.  Apply  signatures  to  human  tumor  datasets  (year  1-2). 

Milestone  #2  Publication  (year  2). 

Our  linear  ordering  procedure  has  been  repeated  for  signatures  within  our 
catalogue  BreSAT-DB,  across  a  compendium  of  ~200  ductal  carcinoma  in  situ  and 
~2000  invasive  breast  carcinomas,  for  which  clinically  annotated  whole  tumor  gene 
expression  data  was  available.  Appropriate  tests  were  used  to  identify  statistically 
significant  associations  between  the  patient  ordering  generated  by  each  signature, 
and  histopathological/clinical  variables  including  intrinsic  subtype,  ESR  status,  Her2 
status,  lymph  node  status,  grade,  recurrence,  and  overall  survival.  An  interesting 
early  finding  was  that  the  large  majority  of  signatures  have  a  significant  association 
with  certain  clinical  variables,  such  as  ESR  status  and  the  tumor  subtype.  In  fact, 
even  random  sets  of  genes  tended  to  produce  significant  associations.  This  is  a 
testament  to  the  enormous  transcriptional  perturbations  that  occur  downstream  of 
specific  molecular  events,  including  activation  of  ESR.  To  compensate  for  this  trend, 
the  significance  of  an  association  with  a  given  molecular  signature  is  adjusted  by 
resampling  10,000  random  gene  sets  of  the  same  size. 

After  adjustment,  there  remained  a  large  number  of  signatures  consistently  having 
significant  associations  with  the  variables  tested,  and  there  was  a  surprising  overlap 
in  the  signatures  that  associate  with  any  given  variable,  (figure  2,3).  Thus  far,  we 
have  identified  239  signatures  that  consistently  had  a  significant  association  with 
molecular  subtype  in  at  least  half  of  the  datasets  investigated  (adjusted  pvalue  <= 
0.05).  Typically  this  association  was  the  result  of  Luminal  A  and  Basal  tumors 
having  vastly  different  patient  ranks.  In  addition,  207  signatures  were  found  to 
consistently  have  significant  associations  with  ER  status,  23  with  lymph  node  status, 
125  with  disease  recurrence,  and  116  with  overall  survival  (161  combined  total  for 
patient  outcome).  As  expected,  signatures  designed  to  predict  patient  outcome  in 
breast  cancer  patients  were  all  highly  significant  in  the  majority  of  datasets. 
Remarkably,  however,  we  have  been  able  to  identify  signatures  with  consistent, 
significant  associations  to  patient  outcome,  but  having  no  such  associations  to  any  of 
the  other  variables  tested.  These  are  signatures  that  encompass  a  variety  of 
processes,  such  a  response  to  hypoxia,  VEGF  signaling,  or  activation  of  the 
complement  immune  system.  Because  such  signatures  operate  independently  of 
known  histopathological/clinical  parameters,  they  represent  a  unique  class  with 
prognostic  value  across  all  subtypes,  which  contrasts  the  types  of  predictors  that  are 
in  clinical  use11.  This  is  an  important  milestone,  because  it  identifies  molecular 
markers  that  are  determinants  of  outcome  in  breast  cancer,  but  have  remained 
unrecognized  to  date.  The  identification  of  such  elements  is  essential  for  the 
development  of  new  classes  of  treatments.  Furthermore,  our  methodology 
represents  a  fundamentally  different  way  of  characterizing  breast  tumors.  Whereas 
traditional  approaches  segment  patients  into  classes  according  to  the  expression  of 
a  small  number  of  genes,  BreSAT  comprehensively  identifies  the  entire  set  of 
pathways,  processes,  responses,  and  cell  types  that  define  the  disease.  This 
exhaustive  cataloguing  of  the  molecular  differences  between  subtypes  is  providing  a 
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more  refined  understanding,  clinically  and  molecularly,  of  the  underlying  biology  of 
the  disease. 

As  there  is  some  indication  that  breast  tumors  of  each  intrinsic  subtype  represent 
distinct  biological  entities,  our  analysis  was  further  extended  to  observe  how 
signatures  associate  with  histopathological/clinical  variables  within  each  individual 
subtype.  BreSAT  was  applied  in  isolation  to  patient  sets  belonging  to  each  of  the  five 
intrinsic  subtypes,  and  statistical  associations  were  determined  as  before. 
Interestingly,  these  results  revealed  that  each  subtype  tends  to  favor  its  own  set  of 
signatures  (and  by  extension,  processes)  that  associate  with  patient  outcome.  The 
luminal  A  subtype  contained  the  largest  number  of  signatures  that  were  associated 
with  patient  outcome  (recurrence  and/or  overall  survival),  most  of  which  ordered 
patients  in  a  manner  that  was  independent  of  ER  status,  LN  status,  and  grade.  In 
contrast,  tumors  belonging  to  the  luminal  B  subtype  had  only  7  signatures 
consistently  associated  with  patient  outcome  in  at  last  half  of  the  datasets  tested. 
Surprisingly,  5  of  these  7  were  signatures  derived  to  specifically  predict  outcome  in 
breast  cancer  patient.  This  suggests  that  patients  with  luminal  B  tumors  are 
especially  good  candidates  for  therapeutic  decision-making  through  genomic 
predictors.  Tumors  within  the  ERBB2  and  Basal  subtypes  also  had  a  small  number 
of  associations  between  signatures  and  patient  outcome  (8  and  2  respectively), 
possibly  due  to  the  smaller  sample  size  of  these  subtypes.  These  associations  related 
to  processes  such  as  TGF-Beta  and  p21  in  the  ERBB2  subtype,  and  CK1  and  mRNA 
processing  in  the  Basal  subtype.  The  disparities  in  the  results  are  perhaps  not 
surprising,  as  the  patients  with  tumors  belonging  to  different  subtypes  tend  to 
receive  different  treatments  for  their  disease.  However,  our  results  are  particularly 
applicable  as  indicators  of  how  and  why  current  treatments  fail  in  different  subsets 
of  breast  cancer  patients. 

Such  results  support  our  hypotheses  that  breast  tumors  can  be  described  by  the 
activation/repression  of  various  molecular  signatures,  which  can  act  in  parallel  or 
orthogonally  to  a  tumor’s  intrinsic  subtype,  and  are  a  consequence  of  the  complex 
mix  of  cell  types  within  the  tumor.  To  better  understand  the  contribution  of 
different  cell  types  to  breast  tumor  biology  and  disease  outcome,  we  next  applied 
BreSAT  to  a  dataset  containing  microdissected  epithelium  and  stroma  tissue  from 
matched  breast  tumors  (figure  4).  As  before,  statistical  tests  were  used  to  identify 
associations  between  signatures  and  histopathological/clinical  variables  of  interest. 
Because  the  process  was  performed  in  matching  tumor  epithelium  and  stroma,  we 
were  able  to  distinguish  between  signatures  that  are  macroenvironmental  (present 
in  all  compartments  of  the  tumor)  vs  those  that  are  microenvironmental  (present 
either  in  epithelium  or  stroma,  but  not  both).  Furthermore,  our  results  have 
revealed  that  some  subsets  of  patients  display  remarkably  similar  signature 
activation/repression  in  matched  tumor  epithelium  and  stroma,  whereas  other 
patient  subsets  are  enriched  in  microenvironment-specific  responses. 

We  are  additionally  investigating  the  types  of  dependencies  that  exist  between 
signatures.  By  quantifying  the  correlation  between  all  possible  pairs  of  signature- 
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derived  patient  orders,  we  identify  functional  associations  between  signatures,  even 
when  the  signatures  represent  vastly  different  biological  processes  that  share  little 
or  no  genes  in  common.  Our  analysis  indicates  that  although  there  is  an 
overrepresentation  of  highly  correlated  signatures  with  a  significant  number  of 
genes  in  common,  there  additionally  exist  many  correlated  signature  pairs  with  no 
overlap.  We  identify  many  such  distinct  types  of  processes  and  cell  types  that 
appear  to  be  highly  correlated  to  one-another,  and  are  currently  examining  ways  of 
subdividing  our  collection  of  signatures  into  a  core  set  of  groups.  The  fact  that  many 
processes  are  co-modulated  suggests  methods  for  building  more  robust  and 
accurate  prognostic  signatures,  that  encompass  a  broader  range  of  clinically- 
relevant  characteristics  with  highly  resilient  signals. 


Task  5.  Hypothesis-driven  generation  of  model  systems  (year  2-3): 

5a.  Selection  of  appropriate  cell  lines  and  mouse  models  (year  3). 

5b.  Molecular  engineering  of  models  (year  3). 

5c.  Analysis  of  modification  success  (year  3). 

Milestone  #3  Publication  (year  3). 

Several  hundred  samples  of  various  mouse  models  and  cell  lines  of  the  disease  have 
been  collected  and  formatted  into  our  compendium.  Our  linear  ordering  procedure 
has  been  repeated  for  all  ~6400  signatures  within  our  catalogue  BreSAT-DB,  thus 
identifying  which  models  have  repression  or  activation  of  processes  of  interest.  Not 
surprisingly,  the  cell  lines  are  largely  reflective  of  primary  breast  tumors  in  terms  of 
the  patterns  of  signature  activation.  For  example,  ESR  positive  cell  lines  tend  to 
display  activation  of  various  endocrine-related  signatures,  while  ESR  negative  cell 
lines  tend  to  display  activation  of  signatures  related  to  MAPK-induced  proliferation. 
Nonetheless,  cell  lines  differ  from  human  tumors  in  the  activation  of  various 
signatures.  For  example,  ESR  positive  human  tumors  display  activation  of  various 
signatures  related  to  cellular  adhesion  and  interaction  with  the  cellular 
microenvironment,  while  ESR  positive  cell  lines  do  not.  This  may  be  explained  by 
differences  in  the  physical  environment  of  the  two  sample  types.  As  changes  in  the 
breast  microenvironment  has  been  shown  to  have  an  effect  on  disease  outcome,  this 
points  to  a  major  component  that  is  lacking  with  2-dimensional  serum-based  cell 
line  models. 

Initial  comparisons  between  human  breast  tumors  and  mouse  models  of  the  disease 
indicate  similar  trends;  while  individual  models  tend  to  share  molecular 
components  with  particular  human  subtypes,  the  similarities  are  imperfect.  For 
example,  over  all  ~6400  gene  sets  in  BreSAT-DB,  the  MMTV-Neu  model  has  an 
activation  pattern  that  is  highly  correlated  with  human  luminal  A  tumors  (Figure  5). 
Both  MMTV-Neu  murine  tumors  and  human  luminal  A  tumors  present  relatively 
high  levels  of  signatures  representing  E2F3  silencing  and  cell  cycle  arrest.  However, 
luminal  A  tumors  consistently  demonstrate  high  activation  of  signatures  relating  to 
ESR  and  other  endocrine  pathways;  a  property  that  is  not  shared  by  MMTV-Neu 
mouse  tumors.  This  is  not  surprising,  given  that  human  luminal  A  tumors  tend  to  be 
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ESR  positive,  while  MMTV-Neu  tumors  are  not.  Furthermore,  MMTV-Neu  murine 
tumors  display  activation  of  various  immune  components  that  are  not  shared  by 
human  luminal  A  tumors  (Figure  6).  Together,  this  implies  where  the  MMTV-Neu 
murine  model  could  be  used  to  test  hypotheses  and  treatments  within  the  human 
luminal  A  subtype,  and  equally  of  value,  when  it  should  not  be  used. 

Work  during  year  3  of  project  will  continue  to  comprehensively  map  out  the 
similarities  and  differences  between  human  breast  tumors  and  models  of  the 
disease.  This  systematic  understanding  will  lead  to  the  modification  and  generation 
more  accurate  and  appropriate  breast  cancer  models. 
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Key  Research  Accomplishments 


•  Construction  of  a  comprehensive  and  highly  annotated  signature  database, 
specific  to  breast  cancer  (BreSAT-DB).  This  database  currently  holds  ~6400 
gene  sets,  a  large  proportion  of  which  were  developed  in  breast-related 
tissue. 

•  Collection  and  formatting  of  over  9000  data  samples  relating  to  breast 
cancer,  primarily  gene  expression  profiles  of  invasive  ductal  carcinoma 
(BreSAT-Compendium). 

•  The  generation  of  various  visual  and  statistical  methodologies  to  apply 
signatures  to  the  collected  datasets,  and  to  determine  the  significance  of 
associations  between  pathways,  processes,  responses,  or  cell  types,  and 
available  histopathological/clinical  parameters. 

•  Application  of  our  signatures  to  human  datasets,  testing  for  statistical 
associations  and  dependencies  between  signatures. 

•  Application  of  our  signatures  to  murine  and  cell  line  models  of  breast  cancer, 
using  the  developed  statistical  tests  to  determine  which  signatures  are  highly 
and  consistently  activated  in  individual  models. 
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Reportable  Outcomes 

Poster 

Title:  Breast  Signature  Analysis  Tool  (BreSAT):  a  framework  for  investigating  the 

molecular  networks  of  breast  cancer 

Conference:  Era  of  Hope 

Location:  Orlando,  Florida 

Date:  August  2011 

Symposium  Presentation 

Title:  Breast  Signature  Analysis  Tool  (BreSAT):  a  framework  for  investigating  the 

molecular  networks  of  breast  cancer 

Conference:  Era  of  Hope 

Location:  Orlando,  Florida 

Date:  August  2011 

Workshop  Presentation 

Title:  Breast  Signature  Analysis  Tool  (BreSAT):  a  framework  for  investigating  the 
molecular  networks  of  breast  cancer 

Conference:  10th  Annual  McGill  Workshop  on  Bioinformatics  in  Barbados:  Systems 
Approaches  in  Translational  Breast  Cancer  Research 
Location:  Holetown,  Barbados 
Date:  January  2011 

Poster 

Title:  Breast  Signature  Analysis  Tool  (BreSAT):  a  framework  for  investigating  the 

molecular  networks  of  breast  cancer 

Conference:  RECOMB  Computational  Cancer  Biology  2010 

Location:  Oslo,  Norway 

Date:  June  2010 

Collection  and  normalization  of  breast  related  data  (BreSAT-Compendium) 

In  total,  our  compendium  now  includes  over  9000  human  patient  samples  with 
associated  histopathological/clinical  data.  Our  compendium  has  been  stratified  by 
stages  of  disease  progression  (e.g.  normal  tissue,  DCIS,  IDC,  metastases,  etc.),  type  of 
sample  (e.g.  whole  tumor  versus  cell-specific  tissue  derived  by  laser  capture 
microdissection),  adjuvant  and  neoadjuvant  treatments,  and  type  of  data  (e.g.  gene 
expression  microarrays,  aCGH,  miRNA,  etc.).  We  are  additionally  in  the  process  of 
collecting  and  formatting  thousands  of  next  generation  sequencing  profiles  for  use. 
The  collection  involves  a  rigorous  process  of  normalization  and  harmonization. 
Clinical  parameters  must  be  carefully  matched  to  determine,  for  example,  whether 
recurrence  is  measured  as  a  local  or  distant  event  that  takes  place  in  a  common  5-  or 
10-year  time  frame.  This  ensures  that  clinical  information  is  directly  comparable 
from  one  dataset  to  the  next,  and  allows  us  to  develop  automated  tools  for  analyzing 
the  data.  While  our  focus  has  been  on  human  data,  we  also  have  hundreds  of  high- 
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throughput  samples  representing  models  for  the  disease,  including  murine  tumors 
and  human  cell  lines. 

Construction  of  signature  database  (BreSAT-DB) 

Collection,  refinement,  and  annotation  of  ~6400  available  molecular  signatures  with 
features  such  as  the  species  and  tissue  they  were  generated  in,  as  well  as  their 
general  category  (e.g.  whether  they  are  used  to  define  a  particular  cell  type, 
biological  response,  or  a  broad  prognostic  response).  Within  each  of  these 
categories,  the  signatures  are  further  sub-classified  as  appropriate  (e.g.  signatures 
that  define  biological  responses  are  sub-classified  into  one  of  ten  hallmarks  of 
cancer6).  While  we  have  collected  numerous  available  gene  sets  from  public 
databases,  we  have  additionally  focused  on  obtaining  signatures  from  the  literature 
that  were  specifically  generated  in  breast-related  tissues.  This  ensures  that  our 
signature  database,  BreSAT-DB,  comprehensively  and  accurately  reflects  those 
pathways,  processes,  responses,  and  cell  types  that  are  relevant  to  breast  cancer. 

Generation  of  a  programming  package  in  Rfor  data  analysis  (BreSAT) 

We  have  developed  numerous  computational  methodologies  to  load  breast-related 
high-throughput  data,  to  filter  and  visualize  signatures  of  interest  in  the  data,  and 
statistics  to  quantify  the  relevance  of  such  applications.  These  functions  have  been 
coded  in  the  R  programming  language  with  a  flexible  design  that  allows  them  to  be 
used  by  other  researchers  with  various  data  types.  The  code  has  been  formatted  as 
an  R  package  to  be  released  for  free  through  bioconductor. 
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Conclusion 


The  framework  we  have  described  is  a  novel  and  important  step  towards  better 
understanding  the  underlying  pathways,  processes,  responses,  and  cell  types  that 
influence  breast  cancer  progression  and  outcome.  Our  data  compendiums  represent 
the  largest  effort  we  are  aware  of  to  collect  high-throughput  breast-related  data  in 
an  appropriately  formatted  and  clinically  annotated  fashion.  Similarly,  our  signature 
collection  BreSAT-DB,  contains  the  largest  signature  collection  known  to  us,  is 
thoroughly  annotated,  and  crucially,  is  highly  specific  to  breast  cancer.  Work  will 
continue  on  the  projects/manuscripts  currently  in  preparation;  we  aim  to  release 
our  framework  to  the  community  in  the  near  future. 

Our  analysis  with  the  BreSAT  framework  has  allowed  us  to  piece  together  the 
interplay  between  individual  molecular  signatures,  and  to  better  understand  how 
this  interplay  affects  the  phenotype  of  breast  cancer.  Our  methodology  introduces  a 
unique  and  intuitive  semi-supervised  approach  to  pathway  analysis,  and  is  robust 
when  multiple  disparate  high-throughput  datasets  are  used.  Crucially,  it  represents 
an  entirely  different  way  of  classifying  the  disease.  Instead  of  relying  on  the  'loudest' 
molecular  signals  that  make  up  the  majority  of  a  transcriptional  profile,  the  status  of 
subtle  but  important  biological  pathways  are  taken  into  account.  BreSAT  provides  a 
means  to  comprehensively  determine  the  classes  of  responses  that  characterize 
patient  outcome,  regardless  of  the  confounding  effect  of  the  tumor  subtype. 

Our  analysis  of  primary  human  tumors  has  identified  numerous  processes  that 
influence  disease  progression  and  outcome.  In  a  similar  manner,  we  have  applied 
our  methodology  to  cell  line  and  mouse  models  of  the  disease.  This  has  allowed  us 
to  determine  which  models  best  reflect  individual  aspects  and/or  subgroups  of  the 
human  disease,  and  in  specifically  what  ways.  Furthermore,  we  have  identified 
important  aspects  of  human  breast  cancer  that  are  lacking  or  inconsistent  in 
available  models  of  the  disease.  Together,  this  work  provides  an  essential  step 
forward  in  understanding  the  molecular  components  that  are  involved  in  breast 
cancer. 
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Figure  1.  ESR  activation  signature  in  a  breast  cancer  dataset.  (A)  Heatmap  of  ESR  activation  signature,  with  rows 
representing  genes,  and  columns  representing  tumors.  Gene  expression  is  colored  from  green  (low)  to  red  (high). 
Samples  are  ordered  from  left  (least  ESR  signaling  activation)  to  right  (most  ESR  signaling  activation)  using  BreSAT. 
Arrow  indicates  increasing  signature  activation  in  the  tumors.  Patients  are  labeled  according  to  their  ESR  IHC  status 
(blue=positive),  and  their  intrinsic  subtype.  (B)  Patients  ranks  of  the  ESR-  and  ESR+  classes  are  displayed  as 
boxplots,  and  are  significantly  different  (p-value=1.6xl0-31).  (C)  Patient  ranks  of  the  intrinsic  subtypes  are  displayed 
as  boxplots,  and  are  significantly  different  (p-value=2.0xl0-39).  (D)  Tumors  were  broadly  divided  in  half  according  to 
their  ranks,  and  Kaplan-Meier  curve  shows  tumor  recurrence  of  the  two  groups.  The  tumors  with  less  ESR  signaling 
activation  have  significantly  worse  outcome  (p-value=2.8xl0'3).  Expression  data  was  obtained  from  [13];  ESR 
activation  signature  was  obtained  from  [14]. 
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Figure  2.  Network  view  of  correlations  between  signature  orderings.  Nodes  represent  each  signature  tested, 
and  are  joined  by  edges  representing  the  highest  positive  1%  and  negative  1%  of  median  correlations  between 
signature  ordering  pairs  across  datasets.  Nodes  are  colored  according  to  the  proportion  of  datasets  where  they  have 
significant  associations  with  ESR  status  (A),  subtype  (B),  recurrence  (C),  and  the  overlap  (D). 
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Figure  3.  Venn  diagram  representing  significant  clinical  associations.  Signatures  must  be  significantly 
associated  (adjusted  p-value<0.05)  with  ESR  status,  Her2  status,  intrinsic  subtype,  and/or  disease  recurrence,  in  at 
least  50%  of  datasets  tested.  21  signatures  were  found  to  be  uniquely  associated  with  recurrence. 
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Figure  4.  Natural  killer  cell-mediated  cytotoxicity  activation  signature  in  stromal  and  epithelial  breast 
tissue.  (A)  Heatmap  showing  laser  capture  microdissected  stromal  tissue  ordered  from  left  (representing  less 
activation  of  the  signature)  to  right  (representing  greater  activation  of  the  signature).  Samples  are  labeled  according 
to  their  intrinsic  stromal  subtype:  ER  high  (light  blue),  fibroblast-enriched  (green),  hypoxic  (red),  immune-enriched 
(purple),  matrix  remodeling  (yellow),  and  mixed  (dark  blue).  (B)  Heatmap  showing  laser  capture  microdissected 
epithelial  tissue  from  the  same  tumors  as  in  A,  and  labeled  according  to  their  intrinsic  stromal  subtype.  Boxplots  of 
the  patient  rank  distributions  for  immune-enriched  (purple)  and  all  other  samples  (gray)  in  stromal  tissue  (C)  and 
epithelial  tissue  (D).  Immune-enriched  stromal  tissue  shows  significantly  greater  activation  of  the  signature  (p- 
vlaue=8.99xl0'4),  while  the  epithelial  tissue  does  not  (p-value=0.560).  Expression  data  was  obtained  from  [15];  the 
signature  was  obtained  from  [16]. 
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Figure  5.  Cross-species  hierarchical  clustering  over  ~6400  gene  sets.  The  relative  tumor  ranks  were  determined 
separately  for  samples  in  each  dataset  [17-18],  and  these  ranks  are  used  as  features  in  the  rows.  The  MMTV-Neu 
mouse  model  clustered  closely  with  human  luminal  A  tumors  (highlighted  with  blue  rectangle],  Heatmap  is  colored 
from  blue  to  red,  representing  least  to  greatest  activation  of  each  individual  signature  respectively. 
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Figure  6.  Comparison  of  relative  activation  of  signatures  between  human  luminal  A  and  murine  MMTV-Neu 
tumors.  (A)  Both  human  luminal  A  and  mouse  MMTV-Neu  display  high  activation  of  genes  downstream  of  E2F3.  (B) 
Luminal  A  tumors  display  high  activation  of  genes  representing  response  to  endocrine  signaling,  while  the  mouse 
tumors  do  not.  (C)  MMTV-Neu  tumors  demonstrate  a  high  transcriptional  response  associated  with  interferon 
activation,  while  the  human  tumors  do  not.  Datasets  are  from  [13,18],  while  signatures  were  obtained  from  [19-21]. 
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